Created attachment 44287 [details] Source demonstrating poor optimization The attached code produces lots of needless spills with gcc-8.1 on Linux. $ gcc -O3 -Wall -march=skylake -ffast-math -fopenmp -c mm-gcc.c 0000000000000080 <mult+0x80> vbroadcastsd ymm2,QWORD PTR [rdx] 0000000000000085 <mult+0x85> vmovupd ymm1,YMMWORD PTR [rcx] 0000000000000089 <mult+0x89> vmovapd ymm0,ymm1 000000000000008d <mult+0x8d> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp] 0000000000000093 <mult+0x93> vmovapd YMMWORD PTR [rsp],ymm0 0000000000000098 <mult+0x98> vmovupd ymm0,YMMWORD PTR [rcx+0x20] 000000000000009d <mult+0x9d> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x20] 00000000000000a4 <mult+0xa4> vmovapd YMMWORD PTR [rsp+0x20],ymm2 00000000000000aa <mult+0xaa> vbroadcastsd ymm2,QWORD PTR [rdx+0x400] 00000000000000b3 <mult+0xb3> vmovapd ymm3,ymm1 00000000000000b7 <mult+0xb7> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x40] 00000000000000be <mult+0xbe> vmovapd YMMWORD PTR [rsp+0x40],ymm3 00000000000000c4 <mult+0xc4> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0x60] 00000000000000cb <mult+0xcb> vmovapd YMMWORD PTR [rsp+0x60],ymm2 00000000000000d1 <mult+0xd1> vbroadcastsd ymm2,QWORD PTR [rdx+0x800] 00000000000000da <mult+0xda> vmovapd ymm3,ymm1 00000000000000de <mult+0xde> vfmadd213pd ymm3,ymm2,YMMWORD PTR [rsp+0x80] 00000000000000e8 <mult+0xe8> vmovapd YMMWORD PTR [rsp+0x80],ymm3 00000000000000f1 <mult+0xf1> vfmadd213pd ymm2,ymm0,YMMWORD PTR [rsp+0xa0] 00000000000000fb <mult+0xfb> vmovapd YMMWORD PTR [rsp+0xa0],ymm2 0000000000000104 <mult+0x104> vbroadcastsd ymm2,QWORD PTR [rdx+0xc00] 000000000000010d <mult+0x10d> vfmadd213pd ymm1,ymm2,YMMWORD PTR [rsp+0xc0] 0000000000000117 <mult+0x117> vmovapd YMMWORD PTR [rsp+0xc0],ymm1 0000000000000120 <mult+0x120> vfmadd213pd ymm0,ymm2,YMMWORD PTR [rsp+0xe0] 000000000000012a <mult+0x12a> vmovapd YMMWORD PTR [rsp+0xe0],ymm0 0000000000000133 <mult+0x133> add rdx,0x8 0000000000000137 <mult+0x137> add rcx,0x400 000000000000013e <mult+0x13e> cmp rsi,rcx 0000000000000141 <mult+0x141> jne 0000000000000080 <mult+0x80> GCC does not issue vector instructions if omp simd is removed. In contrast, clang-6 vectorizes well with or without omp simd: $ clang -O3 -Wall -march=haswell -ffast-math -c mm-gcc.c 00000000000000e0 <mult+0xe0> vmovapd ymm9,ymm6 00000000000000e4 <mult+0xe4> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x800] 00000000000000ee <mult+0xee> vmovupd ymm6,YMMWORD PTR [rax-0x20] 00000000000000f3 <mult+0xf3> vmovupd ymm11,YMMWORD PTR [rax] 00000000000000f7 <mult+0xf7> vfmadd231pd ymm1,ymm6,ymm10 00000000000000fc <mult+0xfc> vfmadd231pd ymm7,ymm11,ymm10 0000000000000101 <mult+0x101> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8-0x400] 000000000000010b <mult+0x10b> vfmadd231pd ymm8,ymm6,ymm10 0000000000000110 <mult+0x110> vfmadd231pd ymm5,ymm11,ymm10 0000000000000115 <mult+0x115> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8] 000000000000011b <mult+0x11b> vfmadd231pd ymm2,ymm6,ymm10 0000000000000120 <mult+0x120> vfmadd231pd ymm3,ymm11,ymm10 0000000000000125 <mult+0x125> vbroadcastsd ymm10,QWORD PTR [rdi+rbx*8+0x400] 000000000000012f <mult+0x12f> vfmadd213pd ymm6,ymm10,ymm9 0000000000000134 <mult+0x134> vfmadd231pd ymm4,ymm11,ymm10 0000000000000139 <mult+0x139> add rax,0x400 000000000000013f <mult+0x13f> add rbx,0x1 0000000000000143 <mult+0x143> jne 00000000000000e0 <mult+0xe0> (I used -march=haswell instead of -march=skylake due to https://bugs.llvm.org/show_bug.cgi?id=37819.)
It might be useful to note that what the testcase "wants" to happen is for the compiler to notice that the temporary array 'double C[Si][Sk]' does not need to live in memory - ideally it would correspond to 8 256-bit (or 4 512-bit) registers.
Confirmed. There's two things in the way - first we transform the #pragma omp simd for (int kk=0; kk<Sk; kk++) { c[(i+ii)*p+k+kk] = C[ii][kk]; } loop to memcpy (we could simply avoid that for force_vectorize loops as a hack). And if we avoid that, for example with -fno-tree-loop-distribute-patterns then we fail to elide the stores to C[]. That happens because unrolling doesn't preserve restrict info and when vectorization makes C addressable it doesn't make restrict info reflect that it doesn't alias with anything. We also do not have a late enough scalarization pass that would elide the array - we'd rely on LIM/DSE here.