Bug 97343 - AVX2 vectorizer generates extremely strange and slow code for AoSoA complex dot product
Summary: AVX2 vectorizer generates extremely strange and slow code for AoSoA complex d...
Status: NEW
Alias: None
Product: gcc
Classification: Unclassified
Component: tree-optimization (show other bugs)
Version: 10.2.0
: P3 normal
Target Milestone: ---
Assignee: Not yet assigned to anyone
URL:
Keywords: missed-optimization
Depends on:
Blocks: vectorizer
  Show dependency treegraph
 
Reported: 2020-10-08 23:09 UTC by Michael_S
Modified: 2020-10-12 13:11 UTC (History)
4 users (show)

See Also:
Host:
Target: x86_64-*-* i?86-*-*
Build:
Known to work:
Known to fail:
Last reconfirmed: 2020-10-09 00:00:00


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Michael_S 2020-10-08 23:09:36 UTC
Let's continue our complex dot product series started here
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96854

This time I have no code generation bugs for your pleasure, just "interesting" optimization issues.
All examples below, unless stated otherwise were compiled with gcc.10.2 for x86-64 with following sets 
of flags: 
set1: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math -fno-associative-math
set2: -Wall -mavx2 -mfma -march=skylake -O3 -ffast-math

The kernel in question is an example of complex dot product in so-called "hybrid AoS" layout a.k.a. AoSoA.
https://en.wikipedia.org/wiki/AoS_and_SoA#Array_of_Structures_of_Arrays
In my experience it's quite rare when in dense complex linear algebra and similar computational fields AoSoA is *not* the optimal internal form.
So, practically, I consider these kernels as more important than AoS kernel presented in bug 96854.

More specifically, the layout can be described as struct { double re[4], im[4]; };
But for purpose of simplicity I omitted the type definition fro code examples and coded it directly over flat arrays of doubles.

Part 1.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re = 0;
  double acc_im = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_re = acc_re + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
      acc_im = acc_im - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_re;
  res[4] = acc_im;
}

That's how we want to code it in the ideal world and let to compiles to care about dirty details.
In less ideal world, gcc is not the only compiler that can't cope with it.
MSVC (-W4 -O2 -fp:fast -arch:AVX2) also can't vectorize it. Even mighty icc generates code that it's not quite bad, but somewhat suboptimal.
So, let's it pass. I don't want to blame gcc for not being smart enough. It's just normal.
Except that when we use set2 the code generated by gcc becomes not just non-smart, but quite crazy.
I am ignoring it in the hope that it would be magically fixed by the change made by Richard Biener 2020-08-31

Part 2.
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim += a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
}

We are explaining it to compiler slowly.
For icc and MSVC it's enough. They understood.
icc generates near-perfect code. I can write it nicer, but do not expect my variant to be any faster.
MSVC generates near-perfect inner loop and epilogue that is not great, but not really much slower.
gcc still didn't get it. It still tries to implement 4 accumulators literally, as if -ffast-math were not here.
But, sad as it is, it's still a case of not being smart enough. So, I am not complaining.

Part 3.
static inline double sum4(double x[]) {
  return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_re[4] = {0};
  double acc_im[4] = {0};
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_re[k] = acc_re[k] + a[c*8+k+0]*b[c*8+k+0] + a[c*8+k+4]*b[c*8+k+4];
      acc_im[k] = acc_im[k] - a[c*8+k+0]*b[c*8+k+4] + a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = sum4(acc_re);
  res[4] = sum4(acc_im);
}

Attempt to feed compiler by teaspoon.
That's not a way I want to write code in HLL.
icc copes, producing about the same code as in Part 1
MSVC doesn't understand a Kunststück (I am sympathetic) and generates literal scalar code with local arrays on stack.
gcc with set1 is a little better than MSVC - the code is fully scalar, but at least accumulators kept in registers.
gcc with set2 is the most interesting. It vectorizes, but how?
Here is an inner loop:

.L3:
	vpermpd	$27, (%r8,%rax), %ymm2
	vpermpd	$27, 32(%rdx,%rax), %ymm3
	vpermpd	$27, (%rdx,%rax), %ymm1
	vpermpd	$27, 32(%r8,%rax), %ymm0
	vmulpd	%ymm2, %ymm1, %ymm6
	vmulpd	%ymm2, %ymm3, %ymm2
	addq	$64, %rax
	vfnmadd132pd	%ymm0, %ymm2, %ymm1
	vfmadd132pd	%ymm3, %ymm6, %ymm0
	vaddpd	%ymm1, %ymm5, %ymm5
	vaddpd	%ymm0, %ymm4, %ymm4
	cmpq	%rcx, %rax
	jne	.L3

What all this vpermpd business about? Shuffling SIMD lanes around just because it's funny?
That the first thing I do want to complain about. Not "not smart enough", but too smart for its own good.

And finally 
Part 4
static inline double sum4(double x[]) {
  return x[0]+x[1]+x[2]+x[3];
}
void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere[4] = {0};
  double acc_imim[4] = {0};
  double acc_reim[4] = {0};
  double acc_imre[4] = {0};
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere[k] += a[c*8+k+0]*b[c*8+k+0];
      acc_imim[k] += a[c*8+k+4]*b[c*8+k+4];
      acc_reim[k] += a[c*8+k+0]*b[c*8+k+4];
      acc_imre[k] += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  double acc_re[4];
  double acc_im[4];
  for (int k = 0; k < 4; ++k) {
    acc_re[k] = acc_rere[k]+acc_imim[k];
    acc_im[k] = acc_imre[k]-acc_reim[k];
  }
  res[0] = sum4(acc_re);
  res[4] = sum4(acc_im);
}

Not just fed by teaspoon, but compiler's mouth held open manually, so to speak.
icc, of course, understands and generates pretty much the same good code as in Part 2.
MSVC, of course, does not understand and generates arrays on stack.

gcc with set2, of course, continues to enjoy a juggling. Doubling or tripling up vs last time's performance.
Inner loop:
.L3:
	vmovupd	(%rdx,%rax), %ymm1
	vmovupd	32(%rdx,%rax), %ymm0
	vmovupd	32(%r8,%rax), %ymm5
	vperm2f128	$49, %ymm1, %ymm0, %ymm3
	vinsertf128	$1, %xmm1, %ymm0, %ymm0
	vpermpd	$221, %ymm0, %ymm10
	vpermpd	$136, %ymm0, %ymm1
	vmovupd	(%r8,%rax), %ymm0
	vpermpd	$136, %ymm3, %ymm9
	vperm2f128	$49, %ymm5, %ymm0, %ymm2
	vinsertf128	$1, %xmm5, %ymm0, %ymm0
	vpermpd	$40, %ymm2, %ymm11
	vpermpd	$125, %ymm0, %ymm5
	vpermpd	$221, %ymm3, %ymm3
	vpermpd	$40, %ymm0, %ymm0
	vpermpd	$125, %ymm2, %ymm2
	addq	$64, %rax
	vfmadd231pd	%ymm2, %ymm3, %ymm8
	vfmadd231pd	%ymm11, %ymm9, %ymm6
	vfmadd231pd	%ymm5, %ymm10, %ymm7
	vfmadd231pd	%ymm0, %ymm1, %ymm4
	cmpq	%rax, %rcx
	jne	.L3

But this time gcc with set1 was a real star of the show. My only reaction is "What?"
.L4:
	vmovupd	0(%r13), %ymm5
	vmovupd	64(%r13), %ymm7
	vmovupd	192(%r13), %ymm4
	vmovupd	128(%r13), %ymm6
	vunpcklpd	32(%r13), %ymm5, %ymm13
	vunpckhpd	32(%r13), %ymm5, %ymm12
	vunpckhpd	96(%r13), %ymm7, %ymm1
	vunpcklpd	96(%r13), %ymm7, %ymm5
	vmovupd	128(%r13), %ymm7
	vunpcklpd	224(%r13), %ymm4, %ymm2
	vunpcklpd	160(%r13), %ymm6, %ymm6
	vunpckhpd	160(%r13), %ymm7, %ymm11
	vunpckhpd	224(%r13), %ymm4, %ymm0
	vpermpd	$216, %ymm13, %ymm13
	vpermpd	$216, %ymm6, %ymm6
	vpermpd	$216, %ymm2, %ymm2
	vpermpd	$216, %ymm5, %ymm5
	vunpcklpd	%ymm2, %ymm6, %ymm4
	vpermpd	$216, %ymm1, %ymm1
	vpermpd	$216, %ymm11, %ymm11
	vunpcklpd	%ymm5, %ymm13, %ymm9
	vpermpd	$216, %ymm12, %ymm12
	vpermpd	$216, %ymm0, %ymm0
	vpermpd	$216, %ymm4, %ymm3
	vpermpd	$216, %ymm9, %ymm9
	vunpckhpd	%ymm2, %ymm6, %ymm4
	vunpckhpd	%ymm5, %ymm13, %ymm5
	vunpcklpd	%ymm1, %ymm12, %ymm6
	vunpcklpd	%ymm0, %ymm11, %ymm2
	vunpckhpd	%ymm1, %ymm12, %ymm12
	vunpckhpd	%ymm0, %ymm11, %ymm0
	vpermpd	$216, %ymm12, %ymm1
	vunpcklpd	%ymm3, %ymm9, %ymm11
	vpermpd	$216, %ymm5, %ymm5
	vpermpd	$216, %ymm4, %ymm4
	vpermpd	$216, %ymm0, %ymm0
	vmovupd	64(%r12), %ymm15
	vpermpd	$216, %ymm6, %ymm6
	vpermpd	$216, %ymm11, %ymm8
	vpermpd	$216, %ymm2, %ymm2
	vunpcklpd	%ymm4, %ymm5, %ymm11
	vunpckhpd	%ymm3, %ymm9, %ymm9
	vunpckhpd	%ymm4, %ymm5, %ymm4
	vunpcklpd	%ymm0, %ymm1, %ymm5
	vpermpd	$216, %ymm9, %ymm3
	vunpcklpd	%ymm2, %ymm6, %ymm9
	vunpckhpd	%ymm2, %ymm6, %ymm2
	vpermpd	$216, %ymm5, %ymm6
	vunpcklpd	96(%r12), %ymm15, %ymm12
	vunpckhpd	%ymm0, %ymm1, %ymm0
	vmovupd	%ymm6, 64(%rsp)
	vunpckhpd	96(%r12), %ymm15, %ymm6
	vmovupd	128(%r12), %ymm15
	vpermpd	$216, %ymm0, %ymm5
	vpermpd	$216, %ymm9, %ymm7
	vmovupd	(%r12), %ymm0
	vunpckhpd	160(%r12), %ymm15, %ymm9
	vmovupd	%ymm5, 96(%rsp)
	vunpcklpd	160(%r12), %ymm15, %ymm5
	vmovupd	192(%r12), %ymm15
	vunpcklpd	32(%r12), %ymm0, %ymm1
	vpermpd	$216, %ymm9, %ymm14
	vunpcklpd	224(%r12), %ymm15, %ymm9
	vunpckhpd	224(%r12), %ymm15, %ymm13
	vunpckhpd	32(%r12), %ymm0, %ymm0
	vpermpd	$216, %ymm12, %ymm12
	vpermpd	$216, %ymm9, %ymm9
	vpermpd	$216, %ymm1, %ymm1
	vpermpd	$216, %ymm5, %ymm5
	vpermpd	$216, %ymm6, %ymm6
	vunpcklpd	%ymm12, %ymm1, %ymm10
	vpermpd	$216, %ymm0, %ymm0
	vpermpd	$216, %ymm13, %ymm13
	vunpckhpd	%ymm12, %ymm1, %ymm1
	vunpcklpd	%ymm9, %ymm5, %ymm12
	vpermpd	$216, %ymm12, %ymm12
	vpermpd	$216, %ymm10, %ymm10
	vunpckhpd	%ymm9, %ymm5, %ymm5
	vunpcklpd	%ymm6, %ymm0, %ymm9
	vunpckhpd	%ymm6, %ymm0, %ymm0
	vunpcklpd	%ymm13, %ymm14, %ymm6
	vunpckhpd	%ymm13, %ymm14, %ymm13
	vpermpd	$216, %ymm13, %ymm14
	vunpcklpd	%ymm12, %ymm10, %ymm13
	vpermpd	$216, %ymm13, %ymm13
	vmulpd	%ymm13, %ymm8, %ymm15
	vpermpd	$216, %ymm5, %ymm5
	vpermpd	$216, %ymm6, %ymm6
	vpermpd	$216, %ymm1, %ymm1
	vpermpd	$216, %ymm9, %ymm9
	vpermpd	$216, %ymm0, %ymm0
	vunpckhpd	%ymm12, %ymm10, %ymm10
	vunpcklpd	%ymm6, %ymm9, %ymm12
	vunpckhpd	%ymm6, %ymm9, %ymm9
	vunpcklpd	%ymm5, %ymm1, %ymm6
	vunpckhpd	%ymm5, %ymm1, %ymm1
	vunpcklpd	%ymm14, %ymm0, %ymm5
	vunpckhpd	%ymm14, %ymm0, %ymm0
	vpermpd	$216, %ymm0, %ymm0
	vmovupd	%ymm0, 160(%rsp)
	vmovq	%r9, %xmm0
	vaddsd	%xmm15, %xmm0, %xmm0
	vunpckhpd	%xmm15, %xmm15, %xmm14
	vpermpd	$216, %ymm10, %ymm10
	vaddsd	%xmm14, %xmm0, %xmm0
	vextractf128	$0x1, %ymm15, %xmm14
	vmulpd	%ymm10, %ymm8, %ymm8
	vaddsd	%xmm14, %xmm0, %xmm15
	vunpckhpd	%xmm14, %xmm14, %xmm14
	vpermpd	$216, %ymm12, %ymm12
	vaddsd	%xmm14, %xmm15, %xmm0
	vmulpd	%ymm10, %ymm3, %ymm15
	vunpckhpd	%xmm8, %xmm8, %xmm10
	vmovq	%xmm0, %r9
	vmovq	%rcx, %xmm0
	vmulpd	%ymm13, %ymm3, %ymm3
	vaddsd	%xmm15, %xmm0, %xmm0
	vunpckhpd	%xmm15, %xmm15, %xmm14
	vextractf128	$0x1, %ymm15, %xmm15
	vaddsd	%xmm14, %xmm0, %xmm14
	vpermpd	$216, %ymm1, %ymm1
	vmovupd	%ymm1, 128(%rsp)
	vaddsd	%xmm15, %xmm14, %xmm14
	vunpckhpd	%xmm15, %xmm15, %xmm15
	vpermpd	$216, %ymm2, %ymm2
	vaddsd	%xmm15, %xmm14, %xmm0
	vmovsd	56(%rsp), %xmm14
	vpermpd	$216, %ymm9, %ymm9
	vaddsd	%xmm8, %xmm14, %xmm14
	vextractf128	$0x1, %ymm8, %xmm8
	vmovq	%xmm0, %rcx
	vaddsd	%xmm10, %xmm14, %xmm10
	vpermpd	$216, %ymm6, %ymm6
	vpermpd	$216, %ymm11, %ymm11
	vaddsd	%xmm8, %xmm10, %xmm10
	vunpckhpd	%xmm8, %xmm8, %xmm8
	vpermpd	$216, %ymm4, %ymm4
	vaddsd	%xmm8, %xmm10, %xmm0
	vmovsd	48(%rsp), %xmm10
	vunpckhpd	%xmm3, %xmm3, %xmm8
	vaddsd	%xmm3, %xmm10, %xmm10
	vextractf128	$0x1, %ymm3, %xmm3
	vmovsd	%xmm0, 56(%rsp)
	vaddsd	%xmm8, %xmm10, %xmm8
	vmulpd	%ymm12, %ymm7, %ymm10
	vmulpd	%ymm9, %ymm7, %ymm7
	vaddsd	%xmm3, %xmm8, %xmm8
	vunpckhpd	%xmm3, %xmm3, %xmm3
	vpermpd	$216, %ymm5, %ymm5
	vaddsd	%xmm3, %xmm8, %xmm0
	vunpckhpd	%xmm10, %xmm10, %xmm3
	addq	$256, %r12
	vmovsd	%xmm0, 48(%rsp)
	vmovq	%rdi, %xmm0
	vaddsd	%xmm10, %xmm0, %xmm8
	vextractf128	$0x1, %ymm10, %xmm10
	vmovq	%rbx, %xmm0
	vaddsd	%xmm3, %xmm8, %xmm3
	vmulpd	%ymm9, %ymm2, %ymm8
	vmulpd	%ymm12, %ymm2, %ymm2
	vaddsd	%xmm10, %xmm3, %xmm3
	vunpckhpd	%xmm10, %xmm10, %xmm10
	addq	$256, %r13
	vaddsd	%xmm10, %xmm3, %xmm1
	vaddsd	%xmm8, %xmm0, %xmm10
	vunpckhpd	%xmm8, %xmm8, %xmm3
	vextractf128	$0x1, %ymm8, %xmm8
	vaddsd	%xmm3, %xmm10, %xmm3
	vmovq	%xmm1, %rdi
	vmovq	%r11, %xmm1
	vaddsd	%xmm8, %xmm3, %xmm3
	vunpckhpd	%xmm8, %xmm8, %xmm8
	vmovq	%r10, %xmm0
	vaddsd	%xmm8, %xmm3, %xmm3
	vmovsd	40(%rsp), %xmm8
	vaddsd	%xmm7, %xmm8, %xmm8
	vmovq	%xmm3, %rbx
	vunpckhpd	%xmm7, %xmm7, %xmm3
	vaddsd	%xmm3, %xmm8, %xmm3
	vextractf128	$0x1, %ymm7, %xmm7
	vaddsd	%xmm7, %xmm3, %xmm3
	vunpckhpd	%xmm7, %xmm7, %xmm7
	vaddsd	%xmm7, %xmm3, %xmm3
	vmovsd	32(%rsp), %xmm7
	vaddsd	%xmm2, %xmm7, %xmm7
	vmovsd	%xmm3, 40(%rsp)
	vunpckhpd	%xmm2, %xmm2, %xmm3
	vaddsd	%xmm3, %xmm7, %xmm3
	vextractf128	$0x1, %ymm2, %xmm2
	vmulpd	%ymm6, %ymm11, %ymm7
	vaddsd	%xmm2, %xmm3, %xmm3
	vunpckhpd	%xmm2, %xmm2, %xmm2
	vaddsd	%xmm2, %xmm3, %xmm2
	vaddsd	%xmm7, %xmm1, %xmm3
	vmovupd	128(%rsp), %ymm1
	vmovsd	%xmm2, 32(%rsp)
	vunpckhpd	%xmm7, %xmm7, %xmm2
	vaddsd	%xmm2, %xmm3, %xmm2
	vextractf128	$0x1, %ymm7, %xmm7
	vmulpd	%ymm1, %ymm4, %ymm3
	vaddsd	%xmm7, %xmm2, %xmm2
	vunpckhpd	%xmm7, %xmm7, %xmm7
	vmulpd	%ymm1, %ymm11, %ymm1
	vaddsd	%xmm7, %xmm2, %xmm2
	vaddsd	%xmm3, %xmm0, %xmm7
	vmulpd	%ymm6, %ymm4, %ymm4
	vmovq	%xmm2, %r11
	vunpckhpd	%xmm3, %xmm3, %xmm2
	vaddsd	%xmm2, %xmm7, %xmm2
	vextractf128	$0x1, %ymm3, %xmm3
	vmovupd	64(%rsp), %ymm6
	vaddsd	%xmm3, %xmm2, %xmm2
	vunpckhpd	%xmm3, %xmm3, %xmm3
	vmovupd	96(%rsp), %ymm7
	vaddsd	%xmm3, %xmm2, %xmm2
	vmovsd	24(%rsp), %xmm3
	vmovupd	160(%rsp), %ymm0
	vaddsd	%xmm1, %xmm3, %xmm3
	vmovq	%xmm2, %r10
	vunpckhpd	%xmm1, %xmm1, %xmm2
	vaddsd	%xmm2, %xmm3, %xmm2
	vextractf128	$0x1, %ymm1, %xmm1
	vmovq	%rbp, %xmm3
	vaddsd	%xmm1, %xmm2, %xmm2
	vunpckhpd	%xmm1, %xmm1, %xmm1
	vaddsd	%xmm1, %xmm2, %xmm2
	vunpckhpd	%xmm4, %xmm4, %xmm1
	vmovsd	%xmm2, 24(%rsp)
	vmovsd	16(%rsp), %xmm2
	vaddsd	%xmm4, %xmm2, %xmm2
	vextractf128	$0x1, %ymm4, %xmm4
	vaddsd	%xmm1, %xmm2, %xmm1
	vaddsd	%xmm4, %xmm1, %xmm1
	vunpckhpd	%xmm4, %xmm4, %xmm4
	vaddsd	%xmm4, %xmm1, %xmm4
	vmovsd	%xmm4, 16(%rsp)
	vmulpd	%ymm6, %ymm5, %ymm4
	vmulpd	%ymm7, %ymm5, %ymm5
	vaddsd	%xmm4, %xmm3, %xmm1
	vunpckhpd	%xmm4, %xmm4, %xmm2
	vmovq	%rsi, %xmm3
	vaddsd	%xmm2, %xmm1, %xmm2
	vextractf128	$0x1, %ymm4, %xmm1
	vaddsd	%xmm1, %xmm2, %xmm2
	vunpckhpd	%xmm1, %xmm1, %xmm1
	vaddsd	%xmm1, %xmm2, %xmm4
	vmovq	%xmm4, %rbp
	vmulpd	%ymm0, %ymm7, %ymm4
	vmulpd	%ymm0, %ymm6, %ymm0
	vaddsd	%xmm4, %xmm3, %xmm1
	vunpckhpd	%xmm4, %xmm4, %xmm2
	vaddsd	%xmm2, %xmm1, %xmm2
	vextractf128	$0x1, %ymm4, %xmm1
	vaddsd	%xmm1, %xmm2, %xmm2
	vunpckhpd	%xmm1, %xmm1, %xmm1
	vaddsd	%xmm1, %xmm2, %xmm4
	vmovsd	8(%rsp), %xmm2
	vunpckhpd	%xmm0, %xmm0, %xmm1
	vaddsd	%xmm0, %xmm2, %xmm2
	vextractf128	$0x1, %ymm0, %xmm0
	vmovq	%xmm4, %rsi
	vaddsd	%xmm1, %xmm2, %xmm1
	vaddsd	%xmm0, %xmm1, %xmm1
	vunpckhpd	%xmm0, %xmm0, %xmm0
	vaddsd	%xmm0, %xmm1, %xmm6
	vmovsd	(%rsp), %xmm1
	vunpckhpd	%xmm5, %xmm5, %xmm0
	vaddsd	%xmm5, %xmm1, %xmm1
	vextractf128	$0x1, %ymm5, %xmm5
	vmovsd	%xmm6, 8(%rsp)
	vaddsd	%xmm0, %xmm1, %xmm0
	vaddsd	%xmm5, %xmm0, %xmm0
	vunpckhpd	%xmm5, %xmm5, %xmm5
	vaddsd	%xmm5, %xmm0, %xmm5
	vmovsd	%xmm5, (%rsp)
	cmpq	%rax, %r12
	jne	.L4
	movl	%r15d, %r12d
	andl	$-4, %r12d
	movl	%r12d, %edx
	cmpl	%r12d, %r15d
	je	.L5
.L3:
	movl	%r15d, %eax
	subl	%r12d, %eax
	cmpl	$1, %eax
	je	.L6
	salq	$6, %r12
	leaq	(%r14,%r12), %r13
	vmovupd	16(%r13), %xmm3
	vmovupd	48(%r13), %xmm0
	vmovupd	64(%r13), %xmm8
	vmovupd	112(%r13), %xmm10
	vmovupd	0(%r13), %xmm4
	vmovupd	32(%r13), %xmm2
	vmovupd	80(%r13), %xmm6
	vmovupd	96(%r13), %xmm1
	vunpcklpd	%xmm3, %xmm4, %xmm5
	vunpckhpd	%xmm3, %xmm4, %xmm4
	vunpcklpd	%xmm0, %xmm2, %xmm3
	vunpckhpd	%xmm0, %xmm2, %xmm2
	vunpcklpd	%xmm6, %xmm8, %xmm0
	vunpckhpd	%xmm6, %xmm8, %xmm6
	vunpcklpd	%xmm10, %xmm1, %xmm8
	vunpckhpd	%xmm10, %xmm1, %xmm1
	vunpcklpd	%xmm3, %xmm5, %xmm11
	vunpcklpd	%xmm2, %xmm4, %xmm10
	vunpckhpd	%xmm3, %xmm5, %xmm3
	vunpckhpd	%xmm2, %xmm4, %xmm2
	vunpcklpd	%xmm8, %xmm0, %xmm5
	vunpcklpd	%xmm1, %xmm6, %xmm4
	vunpckhpd	%xmm8, %xmm0, %xmm0
	vunpckhpd	%xmm1, %xmm6, %xmm1
	addq	%r8, %r12
	vunpcklpd	%xmm5, %xmm11, %xmm8
	vunpckhpd	%xmm0, %xmm3, %xmm7
	vunpckhpd	%xmm5, %xmm11, %xmm11
	vunpckhpd	%xmm1, %xmm2, %xmm5
	vmovupd	64(%r12), %xmm12
	vunpcklpd	%xmm1, %xmm2, %xmm6
	vmovupd	80(%r12), %xmm9
	vmovupd	48(%r12), %xmm1
	vmovupd	96(%r12), %xmm2
	vunpcklpd	%xmm4, %xmm10, %xmm14
	vunpcklpd	%xmm0, %xmm3, %xmm13
	vunpckhpd	%xmm4, %xmm10, %xmm10
	vmovupd	32(%r12), %xmm3
	vmovupd	16(%r12), %xmm4
	vmovapd	%xmm7, 64(%rsp)
	vmovapd	%xmm5, 96(%rsp)
	vmovupd	112(%r12), %xmm7
	vmovupd	(%r12), %xmm5
	movl	%eax, %r12d
	vunpcklpd	%xmm4, %xmm5, %xmm15
	vunpckhpd	%xmm4, %xmm5, %xmm5
	vunpcklpd	%xmm1, %xmm3, %xmm4
	vunpckhpd	%xmm1, %xmm3, %xmm3
	vunpcklpd	%xmm9, %xmm12, %xmm1
	vunpckhpd	%xmm9, %xmm12, %xmm9
	vunpcklpd	%xmm7, %xmm2, %xmm12
	vunpckhpd	%xmm7, %xmm2, %xmm2
	vunpcklpd	%xmm4, %xmm15, %xmm7
	vunpckhpd	%xmm4, %xmm15, %xmm15
	vunpcklpd	%xmm12, %xmm1, %xmm4
	vunpckhpd	%xmm12, %xmm1, %xmm1
	vunpcklpd	%xmm3, %xmm5, %xmm12
	vunpckhpd	%xmm3, %xmm5, %xmm5
	vunpcklpd	%xmm2, %xmm9, %xmm3
	vunpckhpd	%xmm2, %xmm9, %xmm2
	vunpcklpd	%xmm4, %xmm7, %xmm9
	vunpckhpd	%xmm1, %xmm15, %xmm0
	vunpckhpd	%xmm4, %xmm7, %xmm4
	vunpcklpd	%xmm3, %xmm12, %xmm7
	vunpckhpd	%xmm3, %xmm12, %xmm3
	vunpcklpd	%xmm1, %xmm15, %xmm12
	vunpcklpd	%xmm2, %xmm5, %xmm15
	vunpckhpd	%xmm2, %xmm5, %xmm2
	vmulpd	%xmm9, %xmm8, %xmm5
	vmovapd	%xmm0, 128(%rsp)
	vmovq	%r9, %xmm0
	andl	$-2, %r12d
	addl	%r12d, %edx
	vaddsd	%xmm5, %xmm0, %xmm0
	vunpckhpd	%xmm5, %xmm5, %xmm5
	vaddsd	%xmm5, %xmm0, %xmm1
	vmulpd	%xmm4, %xmm11, %xmm5
	vmulpd	%xmm4, %xmm8, %xmm4
	vmovq	%xmm1, %r9
	vmovq	%rcx, %xmm1
	vmulpd	%xmm9, %xmm11, %xmm11
	vaddsd	%xmm5, %xmm1, %xmm1
	vunpckhpd	%xmm5, %xmm5, %xmm5
	vmulpd	%xmm7, %xmm14, %xmm9
	vaddsd	%xmm5, %xmm1, %xmm1
	vmovsd	56(%rsp), %xmm5
	vmulpd	%xmm3, %xmm10, %xmm8
	vaddsd	%xmm4, %xmm5, %xmm5
	vunpckhpd	%xmm4, %xmm4, %xmm4
	vmovq	%xmm1, %rcx
	vaddsd	%xmm4, %xmm5, %xmm4
	vmovq	%rdi, %xmm1
	vmulpd	%xmm3, %xmm14, %xmm14
	vmovsd	%xmm4, 56(%rsp)
	vmovsd	48(%rsp), %xmm4
	vmovq	%rbx, %xmm0
	vaddsd	%xmm11, %xmm4, %xmm4
	vunpckhpd	%xmm11, %xmm11, %xmm11
	vmovsd	40(%rsp), %xmm3
	vaddsd	%xmm11, %xmm4, %xmm4
	vmulpd	%xmm7, %xmm10, %xmm10
	vaddsd	%xmm14, %xmm3, %xmm3
	vmovsd	%xmm4, 48(%rsp)
	vaddsd	%xmm9, %xmm1, %xmm4
	vunpckhpd	%xmm9, %xmm9, %xmm9
	vunpckhpd	%xmm14, %xmm14, %xmm14
	vaddsd	%xmm9, %xmm4, %xmm4
	vmovapd	128(%rsp), %xmm5
	vmovapd	64(%rsp), %xmm11
	vmovq	%xmm4, %rdi
	vaddsd	%xmm8, %xmm0, %xmm4
	vunpckhpd	%xmm8, %xmm8, %xmm8
	vmovsd	24(%rsp), %xmm1
	vaddsd	%xmm8, %xmm4, %xmm4
	vmovsd	16(%rsp), %xmm0
	vmovq	%xmm4, %rbx
	vaddsd	%xmm14, %xmm3, %xmm4
	vmovsd	32(%rsp), %xmm3
	vaddsd	%xmm10, %xmm3, %xmm3
	vunpckhpd	%xmm10, %xmm10, %xmm10
	vmovsd	%xmm4, 40(%rsp)
	vaddsd	%xmm10, %xmm3, %xmm7
	vmulpd	%xmm12, %xmm13, %xmm3
	vmulpd	%xmm5, %xmm13, %xmm13
	vmovsd	%xmm7, 32(%rsp)
	vmovq	%r11, %xmm7
	vmulpd	%xmm11, %xmm12, %xmm12
	vaddsd	%xmm3, %xmm7, %xmm4
	vunpckhpd	%xmm3, %xmm3, %xmm3
	vaddsd	%xmm13, %xmm1, %xmm1
	vaddsd	%xmm3, %xmm4, %xmm7
	vmulpd	%xmm5, %xmm11, %xmm3
	vunpckhpd	%xmm13, %xmm13, %xmm13
	vmovq	%xmm7, %r11
	vmovq	%r10, %xmm7
	vaddsd	%xmm12, %xmm0, %xmm0
	vaddsd	%xmm3, %xmm7, %xmm4
	vunpckhpd	%xmm3, %xmm3, %xmm3
	vunpckhpd	%xmm12, %xmm12, %xmm12
	vaddsd	%xmm3, %xmm4, %xmm7
	vaddsd	%xmm13, %xmm1, %xmm4
	vmovq	%xmm7, %r10
	vmovsd	%xmm4, 24(%rsp)
	vaddsd	%xmm12, %xmm0, %xmm4
	vmulpd	%xmm15, %xmm6, %xmm0
	vmovq	%rbp, %xmm7
	vmovsd	%xmm4, 16(%rsp)
	vmovapd	96(%rsp), %xmm5
	vaddsd	%xmm0, %xmm7, %xmm1
	vunpckhpd	%xmm0, %xmm0, %xmm0
	vmovq	%rsi, %xmm7
	vaddsd	%xmm0, %xmm1, %xmm4
	vmulpd	%xmm5, %xmm2, %xmm0
	vmulpd	%xmm2, %xmm6, %xmm2
	vmovq	%xmm4, %rbp
	vmulpd	%xmm5, %xmm15, %xmm15
	vaddsd	%xmm0, %xmm7, %xmm1
	vunpckhpd	%xmm0, %xmm0, %xmm0
	vaddsd	%xmm0, %xmm1, %xmm4
	vmovsd	8(%rsp), %xmm0
	vaddsd	%xmm2, %xmm0, %xmm0
	vunpckhpd	%xmm2, %xmm2, %xmm2
	vmovq	%xmm4, %rsi
	vaddsd	%xmm2, %xmm0, %xmm6
	vmovsd	(%rsp), %xmm0
	vaddsd	%xmm15, %xmm0, %xmm0
	vunpckhpd	%xmm15, %xmm15, %xmm15
	vmovsd	%xmm6, 8(%rsp)
	vaddsd	%xmm15, %xmm0, %xmm5
	vmovsd	%xmm5, (%rsp)
	cmpl	%r12d, %eax
	je	.L5
Comment 1 Richard Biener 2020-10-09 06:41:30 UTC
All below for Part 2.

Without -ffast-math you are seeing GCC using in-order reductions now while
with -ffast-math the vectorizer gets a bit confused about reassociations done
before, for me producing

.L3:
        vmovupd 32(%rsi,%rax), %ymm3
        vmovupd (%rdx,%rax), %ymm7
        vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
        vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
        vmovupd 32(%rsi,%rax), %ymm5
        vpermpd $136, %ymm0, %ymm4
        vpermpd $40, %ymm2, %ymm7
        vpermpd $221, %ymm0, %ymm1
        vpermpd $125, %ymm2, %ymm3
        vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
        vmovupd (%rdx,%rax), %ymm2
        vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
        addq    $64, %rax
        vpermpd $136, %ymm0, %ymm5
        vpermpd $221, %ymm0, %ymm0
        vpermpd $40, %ymm2, %ymm8
        vpermpd $125, %ymm2, %ymm2
        vmulpd  %ymm8, %ymm5, %ymm5
        vmulpd  %ymm2, %ymm0, %ymm0
        vfmadd132pd     %ymm3, %ymm5, %ymm1
        vfmadd231pd     %ymm7, %ymm4, %ymm0
        vaddpd  %ymm0, %ymm1, %ymm0
        vaddpd  %ymm0, %ymm6, %ymm6
        cmpq    %rcx, %rax
        jne     .L3

-ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
chains and this SLP reduction ends up looking like

t5.c:7:21: note:   Vectorizing SLP tree:
t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
t5.c:7:21: note:        children 0x4100c20
t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
t5.c:7:21: note:        children 0x4100920 0x4100b40
t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
t5.c:7:21: note:        children 0x41004e0 0x41008b0
t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
t5.c:7:21: note:        children 0x3f2a550 0x3f28700
t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
t5.c:7:21: note:        children 0x40b6990 0x3f29e00
t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _88 = *_87;
t5.c:7:21: note:        stmt 1 _96 = *_95;
t5.c:7:21: note:        stmt 2 _96 = *_95;
t5.c:7:21: note:        stmt 3 _88 = *_87;
t5.c:7:21: note:        load permutation { 1 5 5 1 }
t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _94 = *_93;
t5.c:7:21: note:        stmt 1 _86 = *_85;
t5.c:7:21: note:        stmt 2 _94 = *_93;
t5.c:7:21: note:        stmt 3 _86 = *_85;
t5.c:7:21: note:        load permutation { 5 1 5 1 }
t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
t5.c:7:21: note:        children 0x3f287e0 0x3f28770
t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _116 = *_115;
t5.c:7:21: note:        stmt 1 _124 = *_123;
t5.c:7:21: note:        stmt 2 _124 = *_123;
t5.c:7:21: note:        stmt 3 _116 = *_115;
t5.c:7:21: note:        load permutation { 2 6 6 2 }
t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _122 = *_121;
t5.c:7:21: note:        stmt 1 _114 = *_113;
t5.c:7:21: note:        stmt 2 _122 = *_121;
t5.c:7:21: note:        stmt 3 _114 = *_113;
t5.c:7:21: note:        load permutation { 6 2 6 2 }
t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
t5.c:7:21: note:        children 0x4100290 0x4100810
t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _60 = *_59;
t5.c:7:21: note:        stmt 1 _68 = *_67;
t5.c:7:21: note:        stmt 2 _68 = *_67;
t5.c:7:21: note:        stmt 3 _60 = *_59;
t5.c:7:21: note:        load permutation { 0 4 4 0 }
t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _66 = *_65;
t5.c:7:21: note:        stmt 1 _58 = *_57;
t5.c:7:21: note:        stmt 2 _66 = *_65;
t5.c:7:21: note:        stmt 3 _58 = *_57;
t5.c:7:21: note:        load permutation { 4 0 4 0 }
t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
t5.c:7:21: note:        children 0x4100990 0x4100a50
t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _144 = *_143;
t5.c:7:21: note:        stmt 1 _152 = *_151;
t5.c:7:21: note:        stmt 2 _152 = *_151;
t5.c:7:21: note:        stmt 3 _144 = *_143;
t5.c:7:21: note:        load permutation { 3 7 7 3 }
t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
t5.c:7:21: note:        stmt 0 _150 = *_149;
t5.c:7:21: note:        stmt 1 _142 = *_141;
t5.c:7:21: note:        stmt 2 _150 = *_149;
t5.c:7:21: note:        stmt 3 _142 = *_141;
t5.c:7:21: note:        load permutation { 7 3 7 3 }

which eventually shows some non-obvious permute optimization opportunities.
I'm currently working on a permute optimization phase btw. but for start
only handling cases that do not help here.

Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I get
the reduction chain variant which optimizes to

.L3:
        vmovupd 32(%rsi,%rax), %ymm6
        vmovupd 32(%rdx,%rax), %ymm7
        vmovupd (%rsi,%rax), %ymm5
        vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
        vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
        vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
        addq    $64, %rax
        vfmadd231pd     %ymm6, %ymm7, %ymm2
        cmpq    %rcx, %rax
        jne     .L3

even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc).  Which means
the following source change helps:

void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim += a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
}

the reduction epilogue ends up like

        vextractf128    $0x1, %ymm3, %xmm4
        vaddpd  %xmm3, %xmm4, %xmm3
        vunpckhpd       %xmm3, %xmm3, %xmm4
        vaddpd  %xmm3, %xmm4, %xmm3
        vextractf128    $0x1, %ymm2, %xmm4
        vaddpd  %xmm2, %xmm4, %xmm4
        vunpckhpd       %xmm4, %xmm4, %xmm2
        vaddpd  %xmm4, %xmm2, %xmm2
        vextractf128    $0x1, %ymm1, %xmm4
        vaddpd  %xmm1, %xmm4, %xmm4
        vaddsd  %xmm2, %xmm3, %xmm2
        vunpckhpd       %xmm4, %xmm4, %xmm1
        vaddpd  %xmm4, %xmm1, %xmm1
        vextractf128    $0x1, %ymm0, %xmm4
        vaddpd  %xmm0, %xmm4, %xmm4
        vunpckhpd       %xmm4, %xmm4, %xmm0
        vaddpd  %xmm4, %xmm0, %xmm0
        vsubsd  %xmm1, %xmm0, %xmm0
        vzeroupper
        vmovsd  %xmm2, (%rdi)
        vmovsd  %xmm0, 32(%rdi)

which is not optimal since we miss the opportunity to vectorize the
adds of the accumulators

  res[0] = acc_rere+acc_imim;
  res[4] = acc_imre-acc_reim;
Comment 2 Michael_S 2020-10-09 11:58:52 UTC
(In reply to Richard Biener from comment #1)
> All below for Part 2.
> 
> Without -ffast-math you are seeing GCC using in-order reductions now while
> with -ffast-math the vectorizer gets a bit confused about reassociations done
> before, for me producing
> 

Just to understand, when you are saying "Without -ffast-math" does it includes my set1 == "-O3 -ffast-math -fno-associative-math" ?

BTW, I like your phrasing: "a bit confused about reassociations"


> .L3:
>         vmovupd 32(%rsi,%rax), %ymm3
>         vmovupd (%rdx,%rax), %ymm7
>         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
>         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
>         vmovupd 32(%rsi,%rax), %ymm5
>         vpermpd $136, %ymm0, %ymm4
>         vpermpd $40, %ymm2, %ymm7
>         vpermpd $221, %ymm0, %ymm1
>         vpermpd $125, %ymm2, %ymm3
>         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
>         vmovupd (%rdx,%rax), %ymm2
>         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
>         addq    $64, %rax
>         vpermpd $136, %ymm0, %ymm5
>         vpermpd $221, %ymm0, %ymm0
>         vpermpd $40, %ymm2, %ymm8
>         vpermpd $125, %ymm2, %ymm2
>         vmulpd  %ymm8, %ymm5, %ymm5
>         vmulpd  %ymm2, %ymm0, %ymm0
>         vfmadd132pd     %ymm3, %ymm5, %ymm1
>         vfmadd231pd     %ymm7, %ymm4, %ymm0
>         vaddpd  %ymm0, %ymm1, %ymm0
>         vaddpd  %ymm0, %ymm6, %ymm6
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> chains and this SLP reduction ends up looking like
> 
> t5.c:7:21: note:   Vectorizing SLP tree:
> t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> t5.c:7:21: note:        children 0x4100c20
> t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> t5.c:7:21: note:        children 0x4100920 0x4100b40
> t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> t5.c:7:21: note:        children 0x41004e0 0x41008b0
> t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _88 = *_87;
> t5.c:7:21: note:        stmt 1 _96 = *_95;
> t5.c:7:21: note:        stmt 2 _96 = *_95;
> t5.c:7:21: note:        stmt 3 _88 = *_87;
> t5.c:7:21: note:        load permutation { 1 5 5 1 }
> t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _94 = *_93;
> t5.c:7:21: note:        stmt 1 _86 = *_85;
> t5.c:7:21: note:        stmt 2 _94 = *_93;
> t5.c:7:21: note:        stmt 3 _86 = *_85;
> t5.c:7:21: note:        load permutation { 5 1 5 1 }
> t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _116 = *_115;
> t5.c:7:21: note:        stmt 1 _124 = *_123;
> t5.c:7:21: note:        stmt 2 _124 = *_123;
> t5.c:7:21: note:        stmt 3 _116 = *_115;
> t5.c:7:21: note:        load permutation { 2 6 6 2 }
> t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _122 = *_121;
> t5.c:7:21: note:        stmt 1 _114 = *_113;
> t5.c:7:21: note:        stmt 2 _122 = *_121;
> t5.c:7:21: note:        stmt 3 _114 = *_113;
> t5.c:7:21: note:        load permutation { 6 2 6 2 }
> t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> t5.c:7:21: note:        children 0x4100290 0x4100810
> t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _60 = *_59;
> t5.c:7:21: note:        stmt 1 _68 = *_67;
> t5.c:7:21: note:        stmt 2 _68 = *_67;
> t5.c:7:21: note:        stmt 3 _60 = *_59;
> t5.c:7:21: note:        load permutation { 0 4 4 0 }
> t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _66 = *_65;
> t5.c:7:21: note:        stmt 1 _58 = *_57;
> t5.c:7:21: note:        stmt 2 _66 = *_65;
> t5.c:7:21: note:        stmt 3 _58 = *_57;
> t5.c:7:21: note:        load permutation { 4 0 4 0 }
> t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> t5.c:7:21: note:        children 0x4100990 0x4100a50
> t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _144 = *_143;
> t5.c:7:21: note:        stmt 1 _152 = *_151;
> t5.c:7:21: note:        stmt 2 _152 = *_151;
> t5.c:7:21: note:        stmt 3 _144 = *_143;
> t5.c:7:21: note:        load permutation { 3 7 7 3 }
> t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> t5.c:7:21: note:        stmt 0 _150 = *_149;
> t5.c:7:21: note:        stmt 1 _142 = *_141;
> t5.c:7:21: note:        stmt 2 _150 = *_149;
> t5.c:7:21: note:        stmt 3 _142 = *_141;
> t5.c:7:21: note:        load permutation { 7 3 7 3 }
> 
> which eventually shows some non-obvious permute optimization opportunities.
> I'm currently working on a permute optimization phase btw. but for start
> only handling cases that do not help here.
> 
> Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> get
> the reduction chain variant which optimizes to
> 
> .L3:
>         vmovupd 32(%rsi,%rax), %ymm6
>         vmovupd 32(%rdx,%rax), %ymm7
>         vmovupd (%rsi,%rax), %ymm5
>         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
>         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
>         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
>         addq    $64, %rax
>         vfmadd231pd     %ymm6, %ymm7, %ymm2
>         cmpq    %rcx, %rax
>         jne     .L3
> 
> even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 

It's very fragile. I made a tiny (and natural for my real app) change in the source (see below) and the nice MSVC-like inner loop disappeared.

void cdot(double* res, const double* a, const double* b, int N)
{
  double acc_rere = 0;
  double acc_imim = 0;
  double acc_reim = 0;
  double acc_imre = 0;
  for (int c = 0; c < N; ++c) {
    for (int k = 0; k < 4; ++k) {
      acc_rere += a[c*8+k+0]*b[c*8+k+0];
      acc_imim += a[c*8+k+4]*b[c*8+k+4];
      acc_reim -= a[c*8+k+0]*b[c*8+k+4];
      acc_imre += a[c*8+k+4]*b[c*8+k+0];
    }
  }
  res[0] = acc_rere+acc_imim;
  res[4] = acc_reim+acc_imre;
}

> Which means
> the following source change helps:
> 
> void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim += a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;
> }
> 

IMHO, options like "no-tree-reassoc", including and especially within __attribute__((optimize("")))  are for people like you.
People like me, i.e. not compiler guys, don't consider them not only for production, but even for hobby. Unless a hobby is compiler research.

Also, several years ago I was told by (not you, as Richard Biener, but "you" as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as Richard Biener, also took part in discussion, but appeared to hold different opinion, see 70255) told me that __attribute__((optimize(""))) can't be relied on in production. Back than we came to conclusion that this statement has to be in official docs. And indeed since GCC7 section 6.33.1 contains the folowing sentences:
"The optimize attribute should be used for debugging purposes only. It is not suitable in production code."
 

> the reduction epilogue ends up like
> 
>         vextractf128    $0x1, %ymm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vunpckhpd       %xmm3, %xmm3, %xmm4
>         vaddpd  %xmm3, %xmm4, %xmm3
>         vextractf128    $0x1, %ymm2, %xmm4
>         vaddpd  %xmm2, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm2
>         vaddpd  %xmm4, %xmm2, %xmm2
>         vextractf128    $0x1, %ymm1, %xmm4
>         vaddpd  %xmm1, %xmm4, %xmm4
>         vaddsd  %xmm2, %xmm3, %xmm2
>         vunpckhpd       %xmm4, %xmm4, %xmm1
>         vaddpd  %xmm4, %xmm1, %xmm1
>         vextractf128    $0x1, %ymm0, %xmm4
>         vaddpd  %xmm0, %xmm4, %xmm4
>         vunpckhpd       %xmm4, %xmm4, %xmm0
>         vaddpd  %xmm4, %xmm0, %xmm0
>         vsubsd  %xmm1, %xmm0, %xmm0
>         vzeroupper
>         vmovsd  %xmm2, (%rdi)
>         vmovsd  %xmm0, 32(%rdi)
> 
> which is not optimal since we miss the opportunity to vectorize the
> adds of the accumulators
> 
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_imre-acc_reim;

Epilogue is a tricky matter.
There are many ways to do it with about the same speed and which variant is faster could depend on fine microarchitectural details, that could be not the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is an  exaggeration). The optimal sequence also depends on surrounding, e.g. if I change the source to:
  res[0] = acc_rere+acc_imim;
  res[1] = acc_reim-acc_imre; // results adjacent in memory
It could already be different.
In later case it likely would be
  vaddpd  ymm_rere,ymm_imim,ymm_re
  vsubpd  ymm_reim,ymm_imre,ymm_im
  vhadd   ymm_im,ymm_re,ymm_reim
  vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
  vaddpd  xmm2_reimH, xmm_reim, xmm_reim

Even icc can't do it perfectly right now. 
It would be nice (for you personally, at least) if in this case gcc will generate better code than icc, but it is far [far [far...]] less important than robust handling of inner loop.
Comment 3 Richard Biener 2020-10-12 13:11:07 UTC
(In reply to Michael_S from comment #2)
> (In reply to Richard Biener from comment #1)
> > All below for Part 2.
> > 
> > Without -ffast-math you are seeing GCC using in-order reductions now while
> > with -ffast-math the vectorizer gets a bit confused about reassociations done
> > before, for me producing
> > 
> 
> Just to understand, when you are saying "Without -ffast-math" does it
> includes my set1 == "-O3 -ffast-math -fno-associative-math" ?

No, -fno-associative-math breaks it again.

> BTW, I like your phrasing: "a bit confused about reassociations"
> 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm3
> >         vmovupd (%rdx,%rax), %ymm7
> >         vinsertf128     $1, (%rsi,%rax), %ymm3, %ymm0
> >         vinsertf128     $1, 32(%rdx,%rax), %ymm7, %ymm2
> >         vmovupd 32(%rsi,%rax), %ymm5
> >         vpermpd $136, %ymm0, %ymm4
> >         vpermpd $40, %ymm2, %ymm7
> >         vpermpd $221, %ymm0, %ymm1
> >         vpermpd $125, %ymm2, %ymm3
> >         vperm2f128      $49, (%rsi,%rax), %ymm5, %ymm0
> >         vmovupd (%rdx,%rax), %ymm2
> >         vperm2f128      $49, 32(%rdx,%rax), %ymm2, %ymm2
> >         addq    $64, %rax
> >         vpermpd $136, %ymm0, %ymm5
> >         vpermpd $221, %ymm0, %ymm0
> >         vpermpd $40, %ymm2, %ymm8
> >         vpermpd $125, %ymm2, %ymm2
> >         vmulpd  %ymm8, %ymm5, %ymm5
> >         vmulpd  %ymm2, %ymm0, %ymm0
> >         vfmadd132pd     %ymm3, %ymm5, %ymm1
> >         vfmadd231pd     %ymm7, %ymm4, %ymm0
> >         vaddpd  %ymm0, %ymm1, %ymm0
> >         vaddpd  %ymm0, %ymm6, %ymm6
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > -ffast-math vs. non-ffast-math we're using a SLP reduction vs. 4 reduction
> > chains and this SLP reduction ends up looking like
> > 
> > t5.c:7:21: note:   Vectorizing SLP tree:
> > t5.c:7:21: note:   node 0x4100c20 (max_nunits=4, refcnt=2)
> > t5.c:7:21: note:        stmt 0 acc_imre_158 = acc_imre_3 + _34;
> > t5.c:7:21: note:        stmt 1 acc_reim_156 = acc_reim_1 + _8;
> > t5.c:7:21: note:        stmt 2 acc_imim_154 = _21 + acc_imim_35;
> > t5.c:7:21: note:        stmt 3 acc_rere_146 = _11 + acc_rere_29;
> > t5.c:7:21: note:        children 0x3f272e0 0x4100bb0
> > t5.c:7:21: note:   node 0x3f272e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 acc_imre_3 = PHI <acc_imre_158(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 1 acc_reim_1 = PHI <acc_reim_156(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 2 acc_imim_35 = PHI <acc_imim_154(7), 0.0(8)>
> > t5.c:7:21: note:        stmt 3 acc_rere_29 = PHI <acc_rere_146(7), 0.0(8)>
> > t5.c:7:21: note:        children 0x4100c20
> > t5.c:7:21: note:   node 0x4100bb0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _34 = _36 + _157;
> > t5.c:7:21: note:        stmt 1 _8 = _30 + _155;
> > t5.c:7:21: note:        stmt 2 _21 = _15 + _153;
> > t5.c:7:21: note:        stmt 3 _11 = _6 + _145;
> > t5.c:7:21: note:        children 0x4100920 0x4100b40
> > t5.c:7:21: note:   node 0x4100920 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _36 = _37 + _73;
> > t5.c:7:21: note:        stmt 1 _30 = _32 + _71;
> > t5.c:7:21: note:        stmt 2 _15 = _10 + _69;
> > t5.c:7:21: note:        stmt 3 _6 = _31 + _61;
> > t5.c:7:21: note:        children 0x41004e0 0x41008b0
> > t5.c:7:21: note:   node 0x41004e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _37 = _101 + _129;
> > t5.c:7:21: note:        stmt 1 _32 = _99 + _127;
> > t5.c:7:21: note:        stmt 2 _10 = _97 + _125;
> > t5.c:7:21: note:        stmt 3 _31 = _89 + _117;
> > t5.c:7:21: note:        children 0x3f2a550 0x3f28700
> > t5.c:7:21: note:   node 0x3f2a550 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _101 = _88 * _94;
> > t5.c:7:21: note:        stmt 1 _99 = _86 * _96;
> > t5.c:7:21: note:        stmt 2 _97 = _94 * _96;
> > t5.c:7:21: note:        stmt 3 _89 = _86 * _88;
> > t5.c:7:21: note:        children 0x40b6990 0x3f29e00
> > t5.c:7:21: note:   node 0x40b6990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _88 = *_87;
> > t5.c:7:21: note:        stmt 1 _96 = *_95;
> > t5.c:7:21: note:        stmt 2 _96 = *_95;
> > t5.c:7:21: note:        stmt 3 _88 = *_87;
> > t5.c:7:21: note:        load permutation { 1 5 5 1 }
> > t5.c:7:21: note:   node 0x3f29e00 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _94 = *_93;
> > t5.c:7:21: note:        stmt 1 _86 = *_85;
> > t5.c:7:21: note:        stmt 2 _94 = *_93;
> > t5.c:7:21: note:        stmt 3 _86 = *_85;
> > t5.c:7:21: note:        load permutation { 5 1 5 1 }
> > t5.c:7:21: note:   node 0x3f28700 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _129 = _116 * _122;
> > t5.c:7:21: note:        stmt 1 _127 = _114 * _124;
> > t5.c:7:21: note:        stmt 2 _125 = _122 * _124;
> > t5.c:7:21: note:        stmt 3 _117 = _114 * _116;
> > t5.c:7:21: note:        children 0x3f287e0 0x3f28770
> > t5.c:7:21: note:   node 0x3f287e0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _116 = *_115;
> > t5.c:7:21: note:        stmt 1 _124 = *_123;
> > t5.c:7:21: note:        stmt 2 _124 = *_123;
> > t5.c:7:21: note:        stmt 3 _116 = *_115;
> > t5.c:7:21: note:        load permutation { 2 6 6 2 }
> > t5.c:7:21: note:   node 0x3f28770 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _122 = *_121;
> > t5.c:7:21: note:        stmt 1 _114 = *_113;
> > t5.c:7:21: note:        stmt 2 _122 = *_121;
> > t5.c:7:21: note:        stmt 3 _114 = *_113;
> > t5.c:7:21: note:        load permutation { 6 2 6 2 }
> > t5.c:7:21: note:   node 0x41008b0 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _73 = _60 * _66;
> > t5.c:7:21: note:        stmt 1 _71 = _58 * _68;
> > t5.c:7:21: note:        stmt 2 _69 = _66 * _68;
> > t5.c:7:21: note:        stmt 3 _61 = _58 * _60;
> > t5.c:7:21: note:        children 0x4100290 0x4100810
> > t5.c:7:21: note:   node 0x4100290 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _60 = *_59;
> > t5.c:7:21: note:        stmt 1 _68 = *_67;
> > t5.c:7:21: note:        stmt 2 _68 = *_67;
> > t5.c:7:21: note:        stmt 3 _60 = *_59;
> > t5.c:7:21: note:        load permutation { 0 4 4 0 }
> > t5.c:7:21: note:   node 0x4100810 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _66 = *_65;
> > t5.c:7:21: note:        stmt 1 _58 = *_57;
> > t5.c:7:21: note:        stmt 2 _66 = *_65;
> > t5.c:7:21: note:        stmt 3 _58 = *_57;
> > t5.c:7:21: note:        load permutation { 4 0 4 0 }
> > t5.c:7:21: note:   node 0x4100b40 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _157 = _144 * _150;
> > t5.c:7:21: note:        stmt 1 _155 = _142 * _152;
> > t5.c:7:21: note:        stmt 2 _153 = _150 * _152;
> > t5.c:7:21: note:        stmt 3 _145 = _142 * _144;
> > t5.c:7:21: note:        children 0x4100990 0x4100a50
> > t5.c:7:21: note:   node 0x4100990 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _144 = *_143;
> > t5.c:7:21: note:        stmt 1 _152 = *_151;
> > t5.c:7:21: note:        stmt 2 _152 = *_151;
> > t5.c:7:21: note:        stmt 3 _144 = *_143;
> > t5.c:7:21: note:        load permutation { 3 7 7 3 }
> > t5.c:7:21: note:   node 0x4100a50 (max_nunits=4, refcnt=1)
> > t5.c:7:21: note:        stmt 0 _150 = *_149;
> > t5.c:7:21: note:        stmt 1 _142 = *_141;
> > t5.c:7:21: note:        stmt 2 _150 = *_149;
> > t5.c:7:21: note:        stmt 3 _142 = *_141;
> > t5.c:7:21: note:        load permutation { 7 3 7 3 }
> > 
> > which eventually shows some non-obvious permute optimization opportunities.
> > I'm currently working on a permute optimization phase btw. but for start
> > only handling cases that do not help here.
> > 
> > Btw, if I use -ffast-math but disable reassociation via -fno-tree-reassoc I
> > get
> > the reduction chain variant which optimizes to
> > 
> > .L3:
> >         vmovupd 32(%rsi,%rax), %ymm6
> >         vmovupd 32(%rdx,%rax), %ymm7
> >         vmovupd (%rsi,%rax), %ymm5
> >         vfmadd231pd     (%rdx,%rax), %ymm6, %ymm0
> >         vfmadd231pd     (%rdx,%rax), %ymm5, %ymm3
> >         vfmadd231pd     (%rsi,%rax), %ymm7, %ymm1
> >         addq    $64, %rax
> >         vfmadd231pd     %ymm6, %ymm7, %ymm2
> >         cmpq    %rcx, %rax
> >         jne     .L3
> > 
> > even with GCC 10 (-Ofast -march=core-avx2 -fno-tree-reassoc). 
> 
> It's very fragile. I made a tiny (and natural for my real app) change in the
> source (see below) and the nice MSVC-like inner loop disappeared.
> 
> void cdot(double* res, const double* a, const double* b, int N)
> {
>   double acc_rere = 0;
>   double acc_imim = 0;
>   double acc_reim = 0;
>   double acc_imre = 0;
>   for (int c = 0; c < N; ++c) {
>     for (int k = 0; k < 4; ++k) {
>       acc_rere += a[c*8+k+0]*b[c*8+k+0];
>       acc_imim += a[c*8+k+4]*b[c*8+k+4];
>       acc_reim -= a[c*8+k+0]*b[c*8+k+4];
>       acc_imre += a[c*8+k+4]*b[c*8+k+0];
>     }
>   }
>   res[0] = acc_rere+acc_imim;
>   res[4] = acc_reim+acc_imre;
> }

wow, OK ... :/

> > Which means
> > the following source change helps:
> > 
> > void __attribute__((optimize("no-tree-reassoc"))) cdot(double* res, const
> > double* a, const double* b, int N)
> > {
> >   double acc_rere = 0;
> >   double acc_imim = 0;
> >   double acc_reim = 0;
> >   double acc_imre = 0;
> >   for (int c = 0; c < N; ++c) {
> >     for (int k = 0; k < 4; ++k) {
> >       acc_rere += a[c*8+k+0]*b[c*8+k+0];
> >       acc_imim += a[c*8+k+4]*b[c*8+k+4];
> >       acc_reim += a[c*8+k+0]*b[c*8+k+4];
> >       acc_imre += a[c*8+k+4]*b[c*8+k+0];
> >     }
> >   }
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> > }
> > 
> 
> IMHO, options like "no-tree-reassoc", including and especially within
> __attribute__((optimize("")))  are for people like you.
> People like me, i.e. not compiler guys, don't consider them not only for
> production, but even for hobby. Unless a hobby is compiler research.

That's true.

> Also, several years ago I was told by (not you, as Richard Biener, but "you"
> as "gcc maintainers", more specifically, by Manuel López-Ibáñez. You, as
> Richard Biener, also took part in discussion, but appeared to hold different
> opinion, see 70255) told me that __attribute__((optimize(""))) can't be
> relied on in production. Back than we came to conclusion that this statement
> has to be in official docs. And indeed since GCC7 section 6.33.1 contains
> the folowing sentences:
> "The optimize attribute should be used for debugging purposes only. It is
> not suitable in production code."

Yeah, guess from my HPC background I'd say HPC is never "production" ;)
I indeed would avoid optimized("") in say a github hosted project but
IMHO it's as valid as splitting out the relevant function and compiling
the TU with -fno-tree-reassoc - it's tuning the setup for max performance
with a specific compiler and for a specific host.

> 
> > the reduction epilogue ends up like
> > 
> >         vextractf128    $0x1, %ymm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vunpckhpd       %xmm3, %xmm3, %xmm4
> >         vaddpd  %xmm3, %xmm4, %xmm3
> >         vextractf128    $0x1, %ymm2, %xmm4
> >         vaddpd  %xmm2, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm2
> >         vaddpd  %xmm4, %xmm2, %xmm2
> >         vextractf128    $0x1, %ymm1, %xmm4
> >         vaddpd  %xmm1, %xmm4, %xmm4
> >         vaddsd  %xmm2, %xmm3, %xmm2
> >         vunpckhpd       %xmm4, %xmm4, %xmm1
> >         vaddpd  %xmm4, %xmm1, %xmm1
> >         vextractf128    $0x1, %ymm0, %xmm4
> >         vaddpd  %xmm0, %xmm4, %xmm4
> >         vunpckhpd       %xmm4, %xmm4, %xmm0
> >         vaddpd  %xmm4, %xmm0, %xmm0
> >         vsubsd  %xmm1, %xmm0, %xmm0
> >         vzeroupper
> >         vmovsd  %xmm2, (%rdi)
> >         vmovsd  %xmm0, 32(%rdi)
> > 
> > which is not optimal since we miss the opportunity to vectorize the
> > adds of the accumulators
> > 
> >   res[0] = acc_rere+acc_imim;
> >   res[4] = acc_imre-acc_reim;
> 
> Epilogue is a tricky matter.
> There are many ways to do it with about the same speed and which variant is
> faster could depend on fine microarchitectural details, that could be not
> the same even on such quite similar CPUs as Skylake and Zen2 or Skylake and
> Ice Lake, or may be even Skylake and Skylake-X (well, may be the last one is
> an  exaggeration). The optimal sequence also depends on surrounding, e.g. if
> I change the source to:
>   res[0] = acc_rere+acc_imim;
>   res[1] = acc_reim-acc_imre; // results adjacent in memory
> It could already be different.
> In later case it likely would be
>   vaddpd  ymm_rere,ymm_imim,ymm_re
>   vsubpd  ymm_reim,ymm_imre,ymm_im
>   vhadd   ymm_im,ymm_re,ymm_reim
>   vperm2f128 $1,    ymm_reim, ymm_reim, ymm_reimH
>   vaddpd  xmm2_reimH, xmm_reim, xmm_reim
> 
> Even icc can't do it perfectly right now. 
> It would be nice (for you personally, at least) if in this case gcc will
> generate better code than icc, but it is far [far [far...]] less important
> than robust handling of inner loop.

Yeah, the issue really is that with reassociation we miss the obviously
better reduction scheme and the vectorizer cannot translate between one
and the other at the moment (it doesn't itself try reassociating other than
that implied by vectorizing).  We could try to add this capability without
too many issues I guess - the main problem will be that it's decision
upfront for the reduction scheme and that it wouldn't be one based on
actual costs.

Thanks for the insightful loop kernels btw!